Predicting Boston House Prices Using Machine Learning in R

您所在的位置:网站首页 random forest r Predicting Boston House Prices Using Machine Learning in R

Predicting Boston House Prices Using Machine Learning in R

#Predicting Boston House Prices Using Machine Learning in R| 来源: 网络整理| 查看: 265

Predicting Boston House Prices Using Machine Learning in R 2023-02-16

Azdine Bahloul

Université Côte d’Azur

Summary :

This study was carried out during my first year of the Master of Economics program at the Université Cote d’Azur as a requirement for the Big Data and Machine Learning course. In this study, we want to forecast the median value of owner-occupied homes in several Boston areas using a collection of 18 characteristics, including crime rate, percentage of residents with lower socioeconomic level, and proximity to highways. We investigate and clean the dataset first, and then we train and test our models using a variety of regression models, including linear regression, decision trees, and random forests. To choose the best model for predicting home prices, we analyze the models’ performances and evaluate the models using metrics like adjusted R-squared and root mean squared error. The random forest model outperformed the other models, according to the results, having the lowest RMSE and greatest adjusted R-squared. The experiment demonstrates the applicability of machine learning in the study of economics and underlines its potential in predicting home prices.

Introduction

The “BostonHousing2” dataset is one of the many datasets available in R for data exploration and analysis. This dataset contains information on housing prices in the city of Boston, as well as various features of the houses and their environment. The data was collected in the 1970s for a study on air quality in cities. The “BostonHousing2” dataset is often used as an example for applying machine learning techniques and statistical modeling to predict housing prices. The dataset consists of 506 observations and 18 variables, including the median value of owner-occupied homes (in thousands of dollars), the proportion of residential land zoned for lots over 25,000 square feet, nitric oxide concentration, as well as other socio-economic characteristics. The dataset is commonly used in machine learning and data analysis projects to demonstrate the effectiveness of algorithms and models for predicting housing prices.

Data :

The original data are 506 observations on 14 variables, cmedv being the target variable:

crim per capita crime rate by town

zn proportion of residential land zoned for lots over 25,000 sq.ft

indus proportion of non-retail business acres per town

chas Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)

nox nitric oxides concentration (parts per 10 million)

rm average number of rooms per dwelling age proportion of owner-occupied units built prior to 1940

dis weighted distances to five Boston employment centres

rad index of accessibility to radial highways

tax full-value property-tax rate per USD 10,000 ptratio pupil-teacher ratio by town

b 1000(B - 0.63)^2 where B is the proportion of blacks by town

lstat percentage of lower status of the population

medv median value of owner-occupied homes in USD 1000’s

The corrected data set has the following additional columns:

cmedv corrected median value of owner-occupied homes in USD 1000’s

town name of town

tract census tract

lon longitude of census

tract lat latitude of census tract

Exploratory Analysis Data preprocessing: cleaning and validation data("BostonHousing2", package = "mlbench") # Removal of unnecessary columns for our study BostonHousing2 % cor() %>% corrplot()

The chart may look hard to interpret, but it is not in fact. The more the circle is dark the more the correlation is strong. The relation will be positive or negative depending on the color. If we focus our attention on cmedv we can see that there are not a lot of variables that are highly correlated with it. We have Rm and ptratio that look to have a positive correlation with medv and lstat with a very negative correlation.

pairs(~ cmedv + ptratio + tax + lstat + dis + rm + crim, data = BostonHousing2, main = "Boston Data")

lstat, dis and rm are quite good linear variables but things like crim are not linear, in fact, the relationship is quite complicated.

ggplot(BostonHousing2, aes(x = cmedv)) + geom_histogram(aes(y = ..density..), fill = "steelblue", alpha = 0.7, color = "black") + geom_density(color="red", position = "identity") + labs(x = "Median value", y = "Count") + ggtitle("Distribution of CMEDV") + theme(plot.title = element_text(hjust = 0.5)) ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The density is represented by the red curve. We can observe that the median price of a home is skewed to the right, with several outliers.

Checking the dependent variable’s correlation with input characteristics and any features that have close to zero variance are two crucial criteria to verify now (values not varying much within the column).

correlations


【本文地址】


今日新闻


推荐新闻


CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3